Contents

1 Introduction

Gene Set Enrichment Analysis is one of the most common tasks in the analysis of omic data, and is critical for biological interpretation. In the context of Epigenome Wide Association Studies, which typically rank individual cytosines according to the level of differential methylation, enrichment analysis of biological pathways is challenging due to differences in CpG/probe density between genes. Here we propose an empirical Bayes Gene Set Enrichment Analysis (ebGSEA) algorithm, which does not rank CpGs but genes according to the overall level of differential methylation of its CpGs/probes, allowing unbiased and sensitive detection of enriched pathways. ebGSEA serves as a useful GSEA tool for EWAS that use Illumina HM450k and EPIC beadarrays. For details please refer to our publications listed at the end of this tutorial(Dong et al. 2019).

2 Tutorial Example

2.1 Global test to identify genes showing overall differential methylation of CpGs

We used a HM450k buccal swab dataset(Teschendorff et al. 2015), which contains 104 buccal swab samples with smoking-pack-years as phenotype. The beta value matrix has been processed, with samples by column and probes by row.

load("/mnt/local-disk/data/zhutianyu/ebGSEA/trainBUC.rda")
dim(data.m)
## [1] 484272    104

Firstly, we do global test(Goeman et al. 2004) to rank genes according to overall differential methylation level of mapped CpGs. Here are the inputs for doGT function:

  • pheno.v: A vector of phenotype information, must be matched to columns of the input beta matrix.
  • data.m: A matrix of beta values with probes by row and samples by column. Missing values shoud be excluded.
  • model: The regression model for global test. Default is “linear”.
  • array: Array type for the input data. “450k” for Illumina HumanMethylation450 data and “850k” for Illumina MethylationEPIC data.
  • ncores: Number of cores used for parallel running. (default = 4)

Because the function will take a few miniutes to run, we have stored the output of doGT in the pacakge. You can get the result by data("sgtm")

library(ebGSEA)
## 
## sgt.m <-doGT(pheno.v,data.m,array="450k",ncores=20)
data("sgtm")

We can get the following matrix ordered by statistic for all genes tested:

dim(sgt.m)
## [1] 18618     5
head(sgt.m)
##                p-value Statistic  Expected   Std.dev #Cov
## 1545      4.964735e-07 11.851194 0.9708738 0.7611331   38
## 100313886 1.049926e-04 11.044014 0.9708738 1.0936538    5
## 386757    6.694333e-04 10.774705 0.9708738 1.3663415    1
## 1543      2.860083e-05 10.163787 0.9708738 0.8705077   35
## 239       1.952693e-04  9.980009 0.9708738 1.0607971   24
## 3426      7.105079e-04  9.920837 0.9708738 1.2614460    5

2.2 Pathway enrichment analysis with Wilcox Rank Sum Test and Known-Population Meian Test

Then we can apply doGSEAwt function to do pathway enrichment analysis with Wilcox test and Knwon-Population Median Test, here are the input parameters:

  • rankEID.m: The resulted matrix from doGT function, with genes by row and ranked by statistics from global test. Rownames of the matrix should be gene EntrezID.
  • ptw.ls: Lists of Gene EntrezID in each pathway of interest. You can get the 8567 biological terms from Molecular Signatures Database(Subramanian et al. 2005) by data("MSigDB-28Feb14-data").
  • ncores: Number of cores used for parallel running. (default = 4)
  • minN: For each pathway, the minium number of genes(i.e. available in the ranked gene list) to conduct GSEA. If less than this value, the p value of this pathway would be set 1. (default = 5)
  • adjPVth: Adjusted p value threshold to infer a pathway to be significantly enriched or not. P value was derived from Wilcoxon rank sum test and adjusted with BH method. (default = 0.05)
data("MSigDB-28Feb14-data")
topGSEA.lm <- doGSEAwt(rankEID.m = sgt.m, ptw.ls = listEZ.lv, ncores = 10, minN = 5,adjPVth = 0.05)
##  Running Wilcox Test and Known Population Median Test...
##  Done
## 'select()' returned 1:1 mapping between keys and columns

The result is a list of three objects:

  • Rank(P): A matrix showing enriched pathways ranked by adjusted wilcox test p values.
  • Rank(AUC): A matrix showing enriched pathways ranked by AUC.
  • Genestat: Lists of gene symbols in each enriched pathway. Each object contains the statistic and p-value from global test of each gene.

There are 2240 enriched pathway identified for this sample data.

summary(topGSEA.lm)
##           Length Class  Mode   
## Rank(P)   11200  -none- numeric
## Rank(AUC) 11200  -none- numeric
## Genestat   2240  -none- list

In the first of two objects we can view the following information for each enriched pathway rank by adjusted wilcox test p value or AUC:

  • nREP: Number of genes mapped in this pathway
  • AUC: Area under curve from wilcox test
  • P(WT): P value from Wilcox Test
  • P(KPMT): P value from Known Population Median Test)
  • adjP: Adjusted P value for each pathway, using BH method
head(topGSEA.lm$`Rank(P)`)
##                                              nREP       AUC        P(WT)
## CAGGTG_V$E12_Q6                              2314 0.5931273 4.412916e-48
## DODD_NASOPHARYNGEAL_CARCINOMA_UP             1504 0.6089612 4.822036e-45
## TGANTCA_V$AP1_C                              1059 0.6179386 1.946188e-38
## ONDER_CDH1_TARGETS_2_DN                       441 0.6730179 8.322979e-36
## JAEGER_METASTASIS_DN                          249 0.7272961 2.727236e-35
## MEISSNER_BRAIN_HCP_WITH_H3K4ME3_AND_H3K27ME3 1014 0.6057309 4.125484e-30
##                                                   P(KPMT)         adjP
## CAGGTG_V$E12_Q6                              6.763456e-34 3.780545e-44
## DODD_NASOPHARYNGEAL_CARCINOMA_UP             1.984042e-32 2.065519e-41
## TGANTCA_V$AP1_C                              2.268952e-31 5.557665e-35
## ONDER_CDH1_TARGETS_2_DN                      1.885097e-25 1.782574e-32
## JAEGER_METASTASIS_DN                         7.821026e-30 4.672847e-32
## MEISSNER_BRAIN_HCP_WITH_H3K4ME3_AND_H3K27ME3 1.959189e-19 5.890503e-27

From the results we can see the biological term nasopharynx carcinoma is highly enriched. Nasopharynx carcinoma is a cancer that is smoking related and the nasopharynx is exposed to smoke carcinogens.

The third list Genestat consists of the statistic and p-value from global test, of each gene in a specific enriched pathway. The genes in the first pathway:

head(topGSEA.lm$Genestat[[1]])
##             Pvalue Statistic
## MEF2C   0.03455165 3.0828456
## FAM126A 0.19456354 1.3593731
## SCN3B   0.60704648 0.7065586
## MAGI3   0.03416454 3.3049866
## RORC    0.01658052 4.5841419
## SYNCRIP 0.04887433 2.3875709

We can plot the AUC and adjP for each enriched pathway:

plot(x = topGSEA.lm$`Rank(P)`[,2], y = -log10(topGSEA.lm$`Rank(P)`[,5]), xlab ='AUC', ylab = '-log10(adjP)', main = 'AUC and adjP for each enriched pathway', pch = 21, bg = 'red')

2.3 Pathway enrichment analysis with Fisher’s Exact Test

Additionally, ebGSEA allows users to do GSEA on a group of specified CpGs/Genes with Fisher’s Exact Test, without the need of ranking genes.

If your input is a group of CpG, you may use selEIDfromSelCpG function first to derive a group of significant genes. Here are the input parameters:

  • selCpG.v: A vector of user selected CpGs.
  • allCpG.v: A vector of all CpGs the user select the CpGs from.
  • pvth: P-value threshold to infer the number of selected CpGs mapped to a gene is significant or not in a binomial test. (default = 0.3/length(selCpG.v))
  • array: Array type for the input CpGs. “450k” for Illumina HumanMethylation450 data and “850k” for Illumina MethylationEPIC data.

You can use the sample CpGs in the package derived from the same buccal swab dataset by data("sampleCpG"). This group of 40626 CpGs showed differential methylation pattern associated with smoking pack-years.

data("SampleCpG")
sigEID.ls <- selEIDfromSelCpG(selCpG.v = sampleCpG.v, allCpG.v = allCpG.v, array = "450k")
##  Mapping 450k probes to genes...
##  Done

The group of CpGs are significantly mapped to 255 genes.

summary(sigEID.ls)
##        Length Class  Mode     
## selEID  255   -none- character
## selPV   255   -none- numeric  
## allPV  9153   -none- numeric

Then you can apply doGSEAft function to do pathway enrichment analysis with Fisher’s Exact Test, here are the input parameters:

  • selEID.v: A vector of selected Entrez Gene ID.
  • ptw.ls: Lists of Gene EntrezID in each pathway of interest. You can get the 8567 biological terms from Molecular Signatures Database by data("MSigDB-28Feb14-data").
  • allEID.v: A vector of the universal set of Entrez Gene ID which you select genes from.
  • ncores: Number of cores used for parallel running. (default = 4)
  • minN: For each pathway, the minium number of genes(i.e. available in the ranked gene list) to conduct GSEA. If less than this value, the p value of this pathway would be set 1. (default = 5)
  • adjPVth: Adjusted p value threshold to infer a pathway to be significantly enriched or not. P value was derived from Wilcoxon rank sum test and adjusted with BH method. (default = 0.05)
topGSEAft.lm <- doGSEAft(selEID.v = sigEID.ls$selEID, ptw.ls = listEZ.lv, allEID.v = names(mapEIDto450k.lv), ncores = 1, adjPVth = 0.05)

The output of topGSEAft.lm consists of the following items:

  • Rank(P): A matrix showing enriched pathways ranked by adjusted Fisher’s Exact Test p values. “nREP” is the number of genes in the pathway, “nOVL” is the number of selected genes in the pathway, “OR” is the odds ratio of Fisher’s Exact Test, “P” is the p value of Fisher’s Exact Test, “adjP” is the adjusted p value of Fisher’s Exact Test (method=‘BH’), “Genes” is all the selected genes in the pathway.
  • Rank(OR): A matrix showing enriched pathways ranked by odds ratio. The columns are samely defined as in Rank(P).

There are 79 enriched pathway identified by fisher’s exact test, each with 6 features:

summary(topGSEAft.lm)
##          Length Class  Mode     
## Rank(P)  474    -none- character
## Rank(OR) 474    -none- character
head(topGSEAft.lm$`Rank(P)`)
##                                               nREP   nOVL OR                
## JAEGER_METASTASIS_DN                          "250"  "25" "8.774214905728"  
## CHARAFE_BREAST_CANCER_BASAL_VS_MESENCHYMAL_UP "113"  "16" "12.6237084427608"
## ONDER_CDH1_TARGETS_2_DN                       "442"  "26" "4.90504233115453"
## HATADA_METHYLATED_IN_LUNG_CANCER_UP           "365"  "22" "4.96807269041407"
## HOX_GENES                                     "64"   "10" "13.8598927020685"
## CAGGTG_V$E12_Q6                               "2314" "64" "2.40406322547187"
##                                               P                     
## JAEGER_METASTASIS_DN                          "8.37864316495949e-15"
## CHARAFE_BREAST_CANCER_BASAL_VS_MESENCHYMAL_UP "3.17153270879009e-12"
## ONDER_CDH1_TARGETS_2_DN                       "4.47046000236504e-10"
## HATADA_METHYLATED_IN_LUNG_CANCER_UP           "6.7330366346862e-09" 
## HOX_GENES                                     "1.5174248583798e-08" 
## CAGGTG_V$E12_Q6                               "1.90038100328723e-08"
##                                               adjP                  
## JAEGER_METASTASIS_DN                          "7.1779835994208e-11" 
## CHARAFE_BREAST_CANCER_BASAL_VS_MESENCHYMAL_UP "1.35852603581023e-08"
## ONDER_CDH1_TARGETS_2_DN                       "1.27661436134204e-06"
## HATADA_METHYLATED_IN_LUNG_CANCER_UP           "1.44204812123392e-05"
## HOX_GENES                                     "2.59995575234795e-05"
## CAGGTG_V$E12_Q6                               "2.71342734252695e-05"
##                                               Genes                                                                                                                                                                                                                                                                                                                                                                                        
## JAEGER_METASTASIS_DN                          "EHF KLK11 LAMB3 PKP3 CALML3 ALDH3B2 EXPH5 SFN S100A14 IRX4 ELMO3 FXYD3 GPX2 TRIM29 PDZK1IP1 RAB25 S100A2 TFAP2B AOPEP PRSS8 KRT5 SOX15 KRT15 LY6D C1orf116"                                                                                                                                                                                                                                 
## CHARAFE_BREAST_CANCER_BASAL_VS_MESENCHYMAL_UP "EHF PDZK1IP1 PRSS8 KRT5 C6orf132 C1orf116 PKP3 RAB25 TRIM29 FXYD3 KRT15 PTPN6 S100A14 PROM2 ELMO3 BLNK"                                                                                                                                                                                                                                                                                     
## ONDER_CDH1_TARGETS_2_DN                       "ELMO3 DDR1 CD9 IL1RN GJB5 S100A14 EXPH5 TRIM29 C1orf116 SLC7A5 KLK11 EHF SFN RPS6KA1 S100A2 ARHGAP25 KRT5 SOX15 FXYD3 PKP3 KRT8 KRT15 PRSS8 RAB25 LAMB3 IRX4"                                                                                                                                                                                                                               
## HATADA_METHYLATED_IN_LUNG_CANCER_UP           "HOXA9 NFAM1 REC8 FAM83F HLX HOXD12 PHKG1 PLCB2 OSM CMTM2 LSP1 CD93 RSPH6A NFIX TAGLN CYP2W1 HOXB4 HOXA7 PNMA1 TMEM204 CCDC140 SECTM1"                                                                                                                                                                                                                                                       
## HOX_GENES                                     "HLX IRX4 HOXA2 HOXC10 HOXA11 HOXD12 HOXB3 HOXA10 HOXA7 HOXA9"                                                                                                                                                                                                                                                                                                                               
## CAGGTG_V$E12_Q6                               "OSR1 PKP3 SPI1 ELMO3 CCDC140 PAX3 PAX7 NRG2 CD9 KRT15 FBXO2 CTBP1 SEMA4B ALX3 DDR1 TRPV3 FMNL1 SIM1 LSP1 MFAP4 OTX2 PRDM16 CDH7 HEYL C1orf210 FAM53B HOXA11 HOXA10 SMAD3 SFN ITPK1 HOXA7 HOXB3 SOX15 KHDRBS2 NFIX KRT8 GPT FGF1 LMO2 NR4A2 RAB25 BLNK EXPH5 GPX2 PLA2G3 LRRN4CL AGAP2 LY6G6C TNNI2 FAM83F TNS4 IRX4 SYT8 GCNT3 S100A16 S100A14 ACOT11 EHF CLDN15 WNT3A PPP1R16B SGIP1 PLCB2"

We can see that the lung cancer biological term is on the top of the list, which is known to be strongly related with smoking.

Then we can plot the odds ratio and adjPval for each enriched pathway.

plot(x = log2(as.numeric(topGSEAft.lm$`Rank(P)`[,3])), y = -log10(as.numeric(topGSEAft.lm$`Rank(P)`[,5])), xlab ='log2(OR)', ylab = '-log10(adjP)', main = 'OR and adjP for each enriched pathway', pch = 21, bg = 'red')

3 Session information

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS:   /usr/local/lib64/R/3.6.3/lib64/R/lib/libRblas.so
## LAPACK: /usr/local/lib64/R/3.6.3/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ebGSEA_0.1.0     BiocStyle_2.14.4
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4.6         compiler_3.6.3       BiocManager_1.30.10 
##  [4] bitops_1.0-6         tools_3.6.3          digest_0.6.25       
##  [7] bit_1.1-15.2         lattice_0.20-38      annotate_1.64.0     
## [10] evaluate_0.14        RSQLite_2.2.0        memoise_1.1.0       
## [13] pkgconfig_2.0.3      rlang_0.4.5          Matrix_1.2-18       
## [16] DBI_1.1.0            magick_2.3           yaml_2.2.1          
## [19] parallel_3.6.3       xfun_0.14            stringr_1.4.0       
## [22] knitr_1.28           IRanges_2.20.2       vctrs_0.2.4         
## [25] S4Vectors_0.24.4     grid_3.6.3           stats4_3.6.3        
## [28] bit64_0.9-7          globaltest_5.40.0    Biobase_2.46.0      
## [31] AnnotationDbi_1.48.0 survival_3.1-8       XML_3.99-0.3        
## [34] rmarkdown_2.2        bookdown_0.19        org.Hs.eg.db_3.10.0 
## [37] blob_1.2.1           magrittr_1.5         splines_3.6.3       
## [40] htmltools_0.4.0      BiocGenerics_0.32.0  kpmt_0.1.0          
## [43] xtable_1.8-4         stringi_1.4.6        RCurl_1.98-1.2

References

Dong, Danyue, Yuan Tian, Shijie C Zheng, and Andrew E Teschendorff. 2019. “EbGSEA: An Improved Gene Set Enrichment Analysis Method for Epigenome-Wide-Association Studies.” Bioinformatics 35 (18): 3514–6.

Goeman, Jelle J, Sara Van De Geer, Floor De Kort, and Hans C Van Houwelingen. 2004. “A Global Test for Groups of Genes: Testing Association with a Clinical Outcome.” Bioinformatics 20 (1): 93–99.

Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda G Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences of the United States of America 102 (43): 15545–50.

Teschendorff, Andrew E, Zhen Yang, Andrew Wong, Christodoulos P Pipinikas, Yinming Jiao, Allison Jones, Shahzia Anjum, et al. 2015. “Correlation of Smoking-Associated Dna Methylation Changes in Buccal Cells with Dna Methylation Changes in Epithelial Cancer.” JAMA Oncology 1 (4): 476–85.